DOCUMENT RESUME 



ED 376 216 



TM 022 355 



AUTHOR 
TITLE 

INSTITUTION 
PUB DATE 
NOTE 

AVAILABLE FROM 
PUB TYPE 



Spray, Judith; Miller, Tim 

Identifying Nonuniform DIF in Polytomously Scored 
Test Items. ACT Research Report Series 94-1. 
American Coll. Testing Program, Iowa City, Iowa. 
Jun 94 
21p. 

ACT Research Report Series, P.O. Box 168, Iowa City, 
IA 52243. 

Reports - Research/Technical (143) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF01/PC01 Plus Postage. 

Computer Simulation; ^Identification; *Item Bias; 
*Sample Size; *Scoring; Statistical Analysis; *Test 
Items 

'''Logistic Discriminant Function Analysis; Mantel 
Haenszel Procedure; *Polytomous Scoring 



ABSTRACT 

Computer simulations under three conditions of 
polytomous differential item functioning (DIF) compared the ability 
of three different statistical procedures to detect nonuniform DIF. 
The procedures were a nominal and an ordinal extension of the 
Mantel-Haenszel statistic, and logistic discriminant function 
analysis. Results showed that oniy the logistic discriminant function 
analysis could detect all types of nonuniform DIF simulated when 
sample sizes were moderate to large (i.e., N > 500). This procedure 
is recommended when nonuniform DIF identification is required. 
Contains 2 tables, 3 figures, and 10 references. (Author/SLD) 



A * ********* * * * * ft ft * * * ft ft ft ft ft ft ft ft ft ft ft * ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft ft 

* Reproductions supplied by EDRS are the best that can be made * 

from the original document. * 

ft ft ft ft ft * * * * * * * * * * * * * * * * * * * * y c ft * ft ft * * * ft * * * * * * * * * * * * * * y f * * * y c * * * * * * * * * * * * * * * y f * * 



ACT Research Report Series 



94-1 



r-( 

CO 

CO 



Identifying Nonuniform DIF in 
Polytomously Scored Test Items 



Judith Spray 
Tim Miller 



U.S. DC PA ATM E NT Of EDO * VDON 
Office 0* Educational R«*«arcn an<S improvement 
EDUCATIONAL RESOURCES INFORMATION 

/ CENTER (ERIC) 

O This document has been (■•produced M 
received trom the person Of OfQ*ntXetK>n 
Originating il 
□ Minor Changes have been made to improve 
reproduchon quality 

e Pomts of view o* opinion* stated m this docu- 
ment do not necessarily represent official 
OERt po»"tK>n or policy 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



June 1994 



BEST COPY AVAILABLE 




ERIC 2 



For additional copies write: 
ACT Research Report Series 
P.O. Box 168 
Iowa City, Iowa 52243 



1994 by The American College Testing Program. All rights reserved. 



Identifying Nonuniform DIF in Polytomously Scored Test Items 



Judy Spray 
Tim Miller 



9 

ERIC 



Abstract 

Computer simulations under three conditions of polytomous DIF compared the ability of 
three different statistical procedures to detect nonuniform DIF. The procedures were a nominal 
and an ordinal extension of the Mantel-Haenszel statistic, and logistic discriminant function 
analysis. Results showed that only the logistic discriminant function analysis could detect all 
types of nonuniform DIF simulated when sample sizes were moderate-to-large (i.e. N > 500). 
This procedure is recommended when nonuniform DIF identification is required. 



Identifying Nonuniform DIF in Polytomously Scored Test Items 

The use of polytomously scored items in addition to, or in place of the more traditional 
correct/incorrect item formats, requires reconsideration of some of the psychometric procedures 
that are specific to the dichotomous situation. In particular, the identification of differential item 
functioning or DIF within each of J categories of a polytomously scored item requires either 
modifications of procedures that are currently used for dichotomous items, or the creation of new 
procedures that are especially suited for multiple-category item scoring. Several extensions of 
the existing Mantel-Haenszel procedure, a popular method for identifying DIF in dichotomous 
items, have been suggested for the polytomous case. These extended Mantel-Haenszel procedures 
are similar to those used in the dichotomous situation for 0/1 item responses which have been 
tabulated in a 2 X 2 X AT table, in that they assume that there is no three-way interaction. In 
other words, nonuniform DIF is assumed not to exist. The only way that this assumption can 
be tested is if a procedure is used that allows for a specific test of the presence of the three-way 
interaction. Examples include tests of significance of the interaction term in the fitting of a log- 
linear model, or of the interaction coefficient in a logistic regression model (Swaminathan & 
Rogers, 1990). 

The identification of nonuniform DIF might be more important in a polytomous item than 
in a dichotomous one because there are potentially more ways in which the group-by-response- 
by-score interaction can manifest itself in the polytomous situation. For example, it is possible 
that in addition to the usual nonuniform DIF situation in which the proportion of examinees in 
a group with some response, U = u y varies as a function of the conditioning score, coe could 
have the situation where the proportion remains constant throughout the score scale but reverses 
group direction for different item response categories. Although this is not the typical way in 



which nonuniform DIF occurs, its detection is still important. Any useful polytomous DIF 
procedure should he powerful enough to detect such occurrences with sufficiently large power. 

Another method proposed to detect situations of nonuniform DIF in polytomous items is 
called logistic discriminant function analysis (LDFA). This method has recently been suggested 
as a useful procedure for the identification of DIF (both uniform and nonuniform cases) in 
polytomous items (Miller & Spray, 1993). The method is similar to those mentioned previously 
(i.e., log-linear modeling and logistic regression) in that a separate test of the significance of the 
interaction is available. However, the LDFA method is much easier to implement than the 
logistic regression for the polytomous case (Miller & Spray, 1993). The method is identical to 
some log-linear modeling approaches (Hanson, 1992), but may be easier to interpret because of 
graphical procedures which can be used post hoc to investigate the direction and magnitude of 
the DIF visually (Miller & Spray, 1993). 

Although they lack separate tests of any possible interaction, several extensions of the 
Mantel-Haenszel procedure are available for DIF identification in polytomous items, depending 
upon whether the responses can be treated as nominal or ordinal. Mantel and Haenszel (1959) 
extended the 2 X 2 X A" situation to the 2 X 3 X K case with 3 nominal levels of response, and 
showed that a summary chi-squared statistic with 2 degrees of freedom could be obtained (pp. 
743-745). 'Fhe authors also gave approximations for the more general, 2 X J X K situation, 
where there are ./ nominal response levels. Agresti (1990) later summarized work which gave 
exact, rather than approximate, procedures for the more general / X J X K case. 

Mantel ( 1963) latei proposed an extension whereby the./ responses are scored or weighted 
by ordered scores. Mantel showed that the summary score statistic was simply the weighted sum 
oi the./ trei|iienues. weighted b> dir./ scores ;it each of the A levels. This amounted to testing 



a null hypothesis about the mean level of the J responses, so that the summary statistic was tested 
with a single degree of freedom only (Mantel, 1963). This score statistic was extended by 
Landis, Heyman, and Koch (1978) to the / X J X K situation where either the ./ response levels, 
the / levels, or both are ordered and ordinal scores can be assigned to the responses. A 
convenient vector representation of this situation is provided by Agresti (1990, p. 2X6). 

In the 2 X 2 X K situation with dichotomous items, the Mantel-Haenszel procedure often 
is quite robust in detecting DIF, even when there is a serious violation of the assumption of no 
^ree-way interaction. Therefore, the purpose of this paper was to report a series of computer 
simulations in which different types of DIF were present in simulated polytomous item responses. 
Three procedures were then used to detect the presence of DIF, The procedures were compared 
on the basis of their ability to detect true DIF when it existed (i.e., statistical power) and to detect 
it when it did not exist (i.e.. Type I error). The procedures used in the simulations were (1 ) the 
extended Mantel-Haenszel test on nominal data with J-\ degrees of freedom, (2) the Mantel 
score statistic on ordinal data with one degree of freedom , and (3) the LDFA procedure. Each 
procedure is briefly described below. 

Logistic Discriminant Function Analysis 
The logistic discriminant function, which is estimated via the LDFA procedure, can be 

vvrilten as ( i-(;, ( -,,-u 1 x-u,r„ ( xw, 

Pvob(G\X,U) = 1 - - 

■1 + e " 1 

where the a,, / =0, L 2, 3, are the discriminant function coefficients to be estimated and G is 
a Group indicator variable where, for example, G = 1 for the Reference (R) group and G = 0 for 
the Focal (F) group. // is the item response variable that can take on any one of the ./ values 
associated with each item. 
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Tests of significance of the coefficients, a, and ou provide answers to the questions 
concerning nonuniform and uniform DIF, respectively. Specifically, the significance of ex, is 
tested by first fitting *he hierarchical model given by 



Pvob{G\X.U) = 1 _. (2) 

The difference in the log of the likelihood functions obtained from Equations 2 and 1 is used to 
test for nonuniform DIF or the significance of a,. The significance of a, is tested by next fitting 
the /)(/// model, given by 

Prob(G\X.U) = Prob(G|X) = """"' Al (1) 

1 + (,«-«••-«.*'' 

Equation 3 is termed the null model because it represents the probability of group 
membership only as a function of group sample sizes and group distributions on X. The item 
response variable is ignored. Thus, the null model given by Equation 3 remains constant from 
item to item. The difference in the log of the likelihood functions obtained from Equations 3 and 
2 is used to test for uniform DIF or the significance of ou. 

Each difference in the log likelihood functions is asymptotically distributed as a chi- 
squared random variable with one degree of freedom. Thus, with the LDFA procedure, two 
separate tests can be performed for nonuniform and uniform DIF. The nonuniform DIF test can 
also be thought of as a test of the no-three-way interaction assumption. 

Mantel-Haenszel Extensions 

For both extensions described below, the data are assumed to be tabulated in a 2 X ./ X 
K table (i.e.. 2 groups by ./ responses by K levels of the conditioning or matching variable). 
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Nominal Case 

The observed counts or absolute frequencies in ./-l cells for the R group across K levels 

are denoted by n k = (n Klk> n k , k n KJ . lk )'. The expected frequencies under the hypothesis of 

conditional independence (i.e., no uniform D1F) are m k - (iVk n ok- n R+k n t^ n i<tk I1 -j i.k > • ^'k 

denotes the null covanance matrix of n k (see Agresti, 1990, p. 234). Summing over the k strata 
gives n - Xn k , in = !m k , and V = ZY k . Then the nominal version of the extended Mantel- 
Haeiis/.cl statistic is given bv 

A/// niin . = (ii-in) V '(n-rnj. 
I'his statistic has a large -sample chbsquared distribution with ./-I degrees of freedom undu" the 
null hypothesis ol conditional independence. A significant test implies that uuifoini DIP is 
present in the item. 
Ordinal Case 

The observed counts 01 absolute frequencies in the Ath level are denoted h\ n k (n KJl . 

u,. :k ! \, ik- 1 1 1 iK- ■■• ^.ik 1 - l hc expected frequencies under the hypothesis ol conditional 

independence .'i.e., no uniform HIF) aie m k . \\ denotes the null covariaiue maliis of n K . Also, 

Ll I t - w/,.// : //,). a vector of icsponse category scores, such as /. 2 / The scoies will 

u>uall\ luirespiJiid to she values a^i.yned to the scoring of (he item. Then, let H L denote a vector 

ot length // of seme constants, where \\ y (//, n : /«,.-/!,. k The oulinal <>i scored 

veisimi ul the eMeiuled Mantci Haens/ei statistic is uiven (Agresti. l^'Mi, p. 2M % .| b\ 

W// (|>1 - |XH k in k in,}! !>.B t VH. ) '{ i!K k <iVin, , j 
when- the summation U over k. This statistic has a large-sample clii sqiuied disliibution with 
I d» gice 1. 1 Ireeduin under the null hypothesis ol conditional independence (i.e., no uniform 1)11). 
A signiheant tcM implies that unitoim D1I-' is piesent in the item. A simple! but equivalent. 



algebraic representation of Mantel's score statistic for ordinal responses is given by Mantel ( 1 963, 
p. 694). 

Method 

The Simulations 

Item responses were generated from MurakTs generalized partial credit model (Mura!:i. 
1991), which gives the item-response density functions or item category characteristic curves 
(ICCCs) as functions of a unidimensiona! latent ability, 0. This model can be written as 

exp|Ifl(6-/?.)| 

Prob(f/=«, 1 6) = , (4) 

A j m 

Iexp|Ir/(6-/>)| 

where the b } parameters define the points of intersection of the adjoining ICCCs and a 

represents a slope parameter relating to the discriminating power of the item. According to 
Muraku" ... the discriminating power of each ICCC depends on the combination of the slope and 
threshold parameters" (p. 7). Thus, it is possible to have several different levels of discriminating 
power for the different item responses within the same test item. 

There were 20 items on the simulated tests. Only the last item, item #20, had simulated 
DIF. The remaining 19 items had identical item parameters for the two groups. These 
parameters were a = 1.0, = .00, b, = -1.00, = .5, and b A = 1.00. Two sample sizes were 
used for each group: 500 and 2000. Ability populations were assumed to be identical for both 
the focal (F) and reference (R) populations. Ability (i.e., 0) sampling was simulated from a 
standard normal distribution. 
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There were three DIF conditions simulated. The first condition was a simple uniform DIF 
case where the ^-parameters for both groups remained the same but the /^-parameters were shifted 
or offset by a constant amount. For Condition /, the R group parameters were { </=l. (h/?,=0.0 J) 2 =- 
I.(U; s =.5,/j 4 =1.0}, while the F group parameters were {r/=l.()^,=0.(),/? 2 =-.75,/^=.75,/? 4 =L25}. In 
other words, the item was consistently more difficult for each response category for members of 
the F group than for comparable members of the R group. Figure 1 illustrates the lCCCs for this 
item. Response probabilities for the F group are plotted as dotted lines. 

see Figure 1 at end of report 



For the second condition, nonuniform DIF was simulated where the (/-parameters for each 
group varied but the /^-parameters remained the same. For Condition 2, the R group parameters 
were {</=! (),/?,=(). l),/> 2 =-1.0,/?,=.5,/? 4 = 1.0}, while the F group parameters were {</= 5, />,=(). (>,/>,=- 
l.(U^= 5,/> 4 =l.0}. This item was more discriminating for the R group for all response categories. 
See Figure 2. 

see Figure 2 at end of report 



For the last condition, a less traditional type of nonuniform DIF was simulated. In this 
case, the a parameters for each group once again remained the same and only two of the b- 
parameters varied, but in different directions. For Condition J % the R group paiameters were 
{</=! .(),/>, =0.0, 75,/;,=. 75, /> 4 =2.0 }, while the F group parameters were 



{a-l.(),/^|=().()./>,=.75.6,=-.75./; 4 =2.()}. This item was therefore easier for the R group for the 
second category but more difficult for the third category. See Figure 3. 



see Figure 3 at end of report 



One hundred replications were performed for each of the two sample sizes and for each 
of the thiee DIF conditions. A test was significant if the null hypothesis was rejected .at a 
probability le\el that was less than .05/2(1 oi .0025. Power was computed as the number of 
replications, out of a possible 10(1. that a significant test w as observed for item # 2(1. A t\pe I 
error rate was computed as the number of replications, out \t a possible 100. that a significant 
test was observed for items #|-#19. The summary error late was the average error rale over 
those I'i no-I)ll- items. 

Results 

The results of die simulations are summarized in Tables I and 2. Table I gives estimates 
of power lor Item O) lor each ot the diree different DIF procedures, along with the a\eiage chi- 
M|uui\d Matistk. Foi Condition 1 where the item was consistently more difficult foi each 
u-sponso categoiy foi members of the F group than for compatible members of the K group, all 
'hire ot the 1Mb pioLcJums identified the item as having uiiif'mm DIF with similar powei. F» 
«»«• .mailer sample ize- of.SOO, the nominal form of die VIII was less poweiful than the admal 
MM extension. ||ovo\er. at the larger -ample size of 2000. all of die pioeedures yielded high 
power estimates. ,\ K I DFA icm for nonumloim DIF was i.oiisignil lent, as it should ha\c been 
lor this |)|| condition. 
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see Table 1 at end of report 



For Condition 2, where item #20 was more discriminating for the R group for all response 
categories and the traditional nonuniform D1F was present, the LDFA test fur nonuniform DIF 
showed moderate power for a sample size of 500 and higher power at the larger sample size. 
Two of the three uniform DIF tests (MH ortj and LDFA) showed very low power to detect this 
type of nonuniform DIF. as was to be expected. However, the MH nnm procedure showed 
moderate power (.30) in identifying this traditional nonuniform DIF at the larger sample size 
(2000). See Table 1. 

For the third condition, where item #20 was somewhat consistently easier for the R group 
for the second category but more difficult for the third, the MH nwm procedure showed very high 
power to detect this type of nonuniform DIF even with the smaller sample size. The LDFA 
nonuniform DIF test had a low-to-moderate degree of power at the same sample size. Both the 
LDFA nonuniform test and the MH nom demonstrated a high degree of estimated power for DIF 
identification at the larger sample size. Both the MH or(l and the uniform test of the LDFA 
procedure failed to identify this DIF situation in item# 20. 

Table 2 gives estimates of average type I error for Items #1-#19 for each of the three 
different DIF procedures for the three DIF conditions. Recall that the nominal a level for these 
simulations was .0025. Table 2 shows that, with the exception of the LDFA nonuniform test for 
Condition 2, estimated type I error rales were within reasonable ranges of the nominal level for 
all procedures, for all sample sizes, and under all DIF conditions. 
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see Table 2 at end of report 



Discussion and Conclusions 
These simulations showed that the LDFA procedure was capable of identifying simulated 
DIF, both uniform and nonuniform, in polytomous items with a high degree of power. The 
procedure could also distinguish between uniform and nonuniform DlF. The only instance where 
the performance of the LDFA procedure was surpassed by another procedure was the condition 
simulated by Condition 3 when the sample sizes were fairly small. In this instance, the MH n<)(n 
procedure was much more sensitive to the directional change across response categories. 
However, with a larger sample size, the LDFA procedure also identified this type of nonuniform 
D1F accurately. The fact that the MH notn procedure could not identify the type of nonuniform 
DIF simulated in Condition 2, even with fairly large samples of 2000 in each group, would 
suggest that it might not be the best procedure to use if the identification of such DIF is 
important. The MH nr( , statistic was not accurate in identifying true DIF except in the uniform 
DIF situation. Even then, the LDFA approach was equally powerful in uncovering this type of 
DIF. Therefore, when fairly large sample sizes are available (i.e. N > 500), it is recommended 
that the LDFA procedure be used for DIF identification with polytomously scored test items. 
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Table 1 

Power Results, Item # 20 



Condition 


Sample Size 


Procedure 


Power 


Average % 2 






MH nom (3 df) 


.350 


13.016 






MH ord (l df) 


.570 


11.390 




500 


LDFA (uniform) (1 df) 
LDFA (nonuniform) (1 df) 


.600 
.000 


12.136 
1.111 


1 




MH nom (3 df) 


1.000 


40.172 






MH„ rd (1 df) 


1 .000 


38.177 




2000 


LDFA (uniform) (1 df) 


1.000 


38.715 






LDFA (nonuniform) (1 df) 


.000 


0.667 






MH nom (3 df) 


.040 


5.467 






MH ()r(1 (1 df) 


.020 


1.869 




500 


LDFA (uniform) (1 df) 


.020 


1.868 






LDFA (nonuniform) (1 df) 


.440 


9.686 


2 




MH nom (3 df) 


.300 


11.693 






MH (ir(l (1 df) 


.070 


2.737 




2000 


LDFA (uniform) (1 df) 


.060 


2.692 






LDFA (nonuniform) (1 df) 


1.000 


31.196 






MH nom (3 df) 


1.000 


88.951 






MHL.0 df) 


.000 


1.393 




500 


LDFA (uniform) (1 df) 


.000 


1.578 






LDFA (nonuniform) (1 df) 


.370 


8.857 


3 




MH nom (3 df) 


1.000 ■ 


367.159 






MH ord (1 df) 


.000 


1.945 




2000 


LDFA (uniform) (1 df) 


.000 


1.937 






LDFA (nonuniform) (1 df) 


1.000 


31.797 
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Table 2 

Type I Error Results, Items #1-#19 



Condition 


Sample Size 


Procedure 


Error 


Average X : 






MH n „ m (3 df) 


.003 


3.0X0 






MH or<l (1 df) 


.005 


1.056 




500 


LDFA (uniform) (1 df) 


.004 


1.043 






LDFA (nonuniform) (1 df) 


.004 


1.106 


1 




MH n ,„ n (3 df) 


.004 


3.146 






MH ord (ldf) 


.004 


1.124 




2000 


LDFA (uniform) (1 df) 


.004 


1.121 






LDFA (nonuniform) (1 df) 


.003 


1.106 






MH nom (3 df) 


.002 


3.023 






MH l>r<I (1 df) 


.002 


1.106 




500 


LDFA (uniform) (1 df) 


.002 


.993 






LDFA (nonuniform) (1 df) 


.002 


.970 


2 




MH noin (3 df) 


.002 


2.977 






MH ord (1 df) 


.002 


.996 




2000 


LDFA (uniform) (1 df) 


.002 


1.001 






LDFA (nonuniform) (1 df) 


.007 


1.237 






MH nom (3 df) 


.005 


3.045 






MH ord (1 df) 


.005 


1.019 




500 


LDFA (uniform) (1 df) 


.(n)5 


1.012 






LDFA (nonuniform) (1 df) 


.005 


1.018 


3 




MH nom (3 df) 


.003 


3.018 






MH or(I (1 df) 


.002 


.977 




2000 


LDFA (uniform) (1 df) 


.002 


.979 






LDFA (nonuniform) (1 df) 


.001 


1 .009 
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Figure 1 ICCCs for Item #20, Condition 1 
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Figure 2 ICCCs for Item #20. Condition 2 




Figure 3 ICCCs for Item #20, Condition 3 
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